Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058

mattlord · 2022-04-07T19:18:06Z

Description

Because we refresh the tablets in source and target shards involved in a vreplication workflow during the operation (e.g. SwitchWrites) — in order to update the serving vschema and query execution rules — let's also do that for the dry run so that we can provide a warning for users that the operation could fail. Example output:

$ time vtctlclient --server=127.0.0.1:15999 MoveTables -- --dry_run --tablet_types=rdonly,replica SwitchTraffic customer.commerce2customer

Following vreplication streams are running for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Running. VStream Lag: 0s.

MoveTables Error: rpc error: code = Unknown desc = cannot switch traffic for workflow commerce2customer at this time: could not refresh all of the tablets involved in the operation:
failed to successfully refresh all tablets in the customer/0 target shard (<nil>):
  failed to refresh tablet zone1-0000000202: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial tcp 172.31.41.160:16202: connect: connection refused"


real	0m30.081s
user	0m0.013s
sys	0m0.005s
✘-1 ~/git/vitess/examples/local [improve_vrepl_dry_run|…1]

The topotools.RefreshTabletsByShard call is best-effort and lets the caller know if the results were not complete (partial). We use a lower timeout within the traffic switcher when refreshing the state for each tablet (done in a goroutine) as:

The default etcd lock/lease TTL is 60 seconds and if any of the tablets are unresponsive then the topo lock is likely to be lost by the time we go to release it
The caller will know if the results were partial and can take appropriate action

We also add tablet refresh to the traffic switcher's pre-check call to ensure that we have healthy tablets on the source and target shards before we begin the operation as otherwise the operation is not safe and can fail. Example output:

$ time vtctlclient --server=127.0.0.1:15999 MoveTables -- --tablet_types=primary,rdonly,replica SwitchTraffic customer.commerce2customer 2>/dev/null

Following vreplication streams are running for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Running. VStream Lag: 0s.

MoveTables Error: rpc error: code = Unknown desc = cannot switch traffic for workflow commerce2customer at this time: could not refresh all of the tablets involved in the operation:
failed to successfully refresh all tablets in the commerce/0 source shard:
  failed to refresh tablet zone1-0000000102: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial tcp 172.31.41.160:16102: connect: connection refused"


real	0m30.064s
user	0m0.008s
sys	0m0.009s
✘-1 ~/git/vitess/examples/local [improve_vrepl_dry_run|…1]

All of these things help to make vreplication workflows more predictable, more reliable, and incur less downtime.

Related Issue(s)

MoveTables: Refresh SrvVSchema (for Routing Rules) and source tablets (for Blacklisted Tables) on completion #7505

Checklist

Should this PR be backported? NO
Tests were not required (I don't think... for the dry run output)
Documentation: Add info about topo refreshes during cutover website#1013

go/vt/topotools/keyspace.go

And refresh all tablets in dry runs to alert the user to potential issues. Signed-off-by: Matt Lord <[email protected]>

Signed-off-by: Matt Lord <[email protected]>

This keeps the previous behavior of ignoring partial refresh related errors, while providing the defails of WHY we had a partial refresh for anyone that's interested. Signed-off-by: Matt Lord <[email protected]>

mattlord · 2022-04-19T05:53:22Z

I ended up changing the function signature for topotools.RefreshTabletsByShard() so we can easily see where it's being called in the PR changes. The only thing left is going through each of those to see if we should START checking for partial results (they were ignored before). @rohit-nayak-ps or @deepthi happy to walk through that if either of you have opinions. The time were we could check and cancel is between:

The pre-check:

vitess/go/vt/wrangler/workflow.go

Lines 276 to 282 in 18f9573

    
           reason, err := vrw.canSwitch(keyspace, workflowName) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           if reason != "" { 
        
           	return nil, fmt.Errorf("cannot switch traffic for workflow %s at this time: %s", workflowName, reason) 
        
           }

The point of no return, where we create the journal records:

For SwitchReads: ... not clear we have one

For SwitchWrites:

vitess/go/vt/wrangler/traffic_switcher.go

Lines 572 to 577 in 18f9573

    
           // This is the point of no return. Once a journal is created, 
        
           // traffic can be redirected to target shards. 
        
           if err := sw.createJournals(ctx, sourceWorkflows); err != nil { 
        
           	ts.Logger().Errorf("createJournals failed: %v", err) 
        
           	return 0, nil, err 
        
           }

rohit-nayak-ps

lgtm

mattlord · 2022-04-19T19:59:56Z

go/vt/topotools/keyspace.go

@@ -121,7 +137,7 @@ func UpdateShardRecords(
 		// For 'to' shards, refresh to make them serve. The 'from' shards will
 		// be refreshed after traffic has migrated.
 		if !isFrom {
-			if _, err := RefreshTabletsByShard(ctx, ts, tmc, si, cells, logger); err != nil {
+			if _, _, err := RefreshTabletsByShard(ctx, ts, tmc, si, cells, logger); err != nil {


This is called from a number of places via Wrangler.updateShardRecords(). I don't think that we can change the partial handling here w/o other changes.

go/vt/vtctl/grpcvtctldserver/server.go

This way the caller has that info and can decide what to do with it (if anything). Signed-off-by: Matt Lord <[email protected]>

mattlord · 2022-04-19T20:12:31Z

go/vt/wrangler/keyspace.go

@@ -354,7 +354,7 @@ func (wr *Wrangler) cancelHorizontalResharding(ctx context.Context, keyspace, sh

 		destinationShards[i] = updatedShard

-		if _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {
+		if _, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {


Doesn't make sense to return an error when canceling.

mattlord · 2022-04-19T20:15:00Z

go/vt/wrangler/keyspace.go

@@ -442,7 +442,7 @@ func (wr *Wrangler) MigrateServedTypes(ctx context.Context, keyspace, shard stri
 		refreshShards = destinationShards
 	}
 	for _, si := range refreshShards {
-		_, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, cells, wr.Logger())
+		_, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, cells, wr.Logger())


I believe this is past the point of no return.

mattlord · 2022-04-19T20:15:04Z

go/vt/wrangler/keyspace.go

@@ -792,7 +792,7 @@ func (wr *Wrangler) masterMigrateServedType(ctx context.Context, keyspace string
 	}

 	for _, si := range destinationShards {
-		if _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {
+		if _, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {


I believe this is past the point of no return.

mattlord · 2022-04-19T20:15:13Z

go/vt/wrangler/keyspace.go

@@ -1226,7 +1226,7 @@ func (wr *Wrangler) replicaMigrateServedFrom(ctx context.Context, ki *topo.Keysp

 	// Now refresh the source servers so they reload the denylist
 	event.DispatchUpdate(ev, "refreshing sources tablets state so they update their denied tables")
-	_, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, sourceShard, cells, wr.Logger())
+	_, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, sourceShard, cells, wr.Logger())


I believe this is past the point of no return.

mattlord · 2022-04-19T20:19:04Z

go/vt/wrangler/traffic_switcher.go

-		_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())
+		rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout)
+		defer cancel()
+		_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())


This is called when canceling.

mattlord · 2022-04-19T20:19:37Z

go/vt/wrangler/traffic_switcher.go

-		_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())
+		rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout)
+		defer cancel()
+		_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())


This is a safe/good place to return an error if we could not refresh all tablets as its early on in the operation. I'll work on this one.

Done: 90f656f

mattlord · 2022-04-19T20:21:14Z

go/vt/wrangler/traffic_switcher.go

-		_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), target.GetShard(), nil, ts.Logger())
+		rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout)
+		defer cancel()
+		_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), target.GetShard(), nil, ts.Logger())


This is after the point of no return.

Signed-off-by: Matt Lord <[email protected]>

deepthi

Nice work! lgtm

mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management Component: VReplication release notes labels Apr 7, 2022

mattlord force-pushed the improve_vrepl_dry_run branch 9 times, most recently from c303c69 to 7eaec39 Compare April 11, 2022 15:41

rohit-nayak-ps reviewed Apr 11, 2022

View reviewed changes

go/vt/topotools/keyspace.go Outdated Show resolved Hide resolved

mattlord added the Skip Upgrade Downgrade label Apr 11, 2022

mattlord force-pushed the improve_vrepl_dry_run branch from 7eaec39 to 2f1f9ff Compare April 11, 2022 18:45

mattlord changed the title ~~Lower shard tablet refresh timeout and check this in vrepl dry runs~~ Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling Apr 11, 2022

mattlord force-pushed the improve_vrepl_dry_run branch from 6a1489b to eb75634 Compare April 11, 2022 19:36

mattlord requested review from rohit-nayak-ps, deepthi and sougou April 11, 2022 20:10

mattlord force-pushed the improve_vrepl_dry_run branch 3 times, most recently from e7d1b54 to 67b1639 Compare April 15, 2022 16:51

mattlord added 6 commits April 18, 2022 16:08

Use lower tablet refresh timeout in vreplication traffic switching

8889e16

And refresh all tablets in dry runs to alert the user to potential issues. Signed-off-by: Matt Lord <[email protected]>

Check primary tablet health in the traffic switcher pre-check

8ca8690

Signed-off-by: Matt Lord <[email protected]>

change status to state in dry run text

e10b2b9

Signed-off-by: Matt Lord <[email protected]>

correct pre-check error text

79ca4e7

Signed-off-by: Matt Lord <[email protected]>

Get WARNING color working in dry run output

c6d0505

Signed-off-by: Matt Lord <[email protected]>

Changes based on team discussions...

34c99a0

Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the improve_vrepl_dry_run branch from 1e9f82f to 11fb7ca Compare April 19, 2022 04:54

Keep function semantics the same but provide details if wanted

18f9573

This keeps the previous behavior of ignoring partial refresh related errors, while providing the defails of WHY we had a partial refresh for anyone that's interested. Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the improve_vrepl_dry_run branch from 11fb7ca to 18f9573 Compare April 19, 2022 05:34

mattlord marked this pull request as ready for review April 19, 2022 05:40

mattlord requested review from ajm188, doeg and notfelineit as code owners April 19, 2022 05:40

mattlord removed request for doeg, ajm188 and notfelineit April 19, 2022 05:54

rohit-nayak-ps approved these changes Apr 19, 2022

View reviewed changes

mattlord commented Apr 19, 2022

View reviewed changes

go/vt/vtctl/grpcvtctldserver/server.go Outdated Show resolved Hide resolved

Add partial refresh details to response message

96b96bb

This way the caller has that info and can decide what to do with it (if anything). Signed-off-by: Matt Lord <[email protected]>

mattlord commented Apr 19, 2022

View reviewed changes

mattlord added 3 commits April 19, 2022 16:31

Update tests and vtadmin for updated refresh response message

0646033

Signed-off-by: Matt Lord <[email protected]>

Cancel switch writes on partial tablet refresh when safe

90f656f

Signed-off-by: Matt Lord <[email protected]>

Unify output capitalization

8295a72

Signed-off-by: Matt Lord <[email protected]>

mattlord mentioned this pull request Apr 20, 2022

Add info about topo refreshes during cutover vitessio/website#1013

Merged

deepthi approved these changes Apr 20, 2022

View reviewed changes

mattlord merged commit b5ca7a6 into vitessio:main Apr 20, 2022

mattlord deleted the improve_vrepl_dry_run branch April 20, 2022 19:22

mattlord mentioned this pull request Feb 10, 2023

vtctlclient SwitchReads with -dry_run results in unexpected lock errors #6480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058

Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058

mattlord commented Apr 7, 2022 •

edited

Loading

mattlord commented Apr 19, 2022

rohit-nayak-ps left a comment

mattlord Apr 19, 2022

mattlord Apr 19, 2022

mattlord Apr 19, 2022

mattlord Apr 19, 2022

mattlord Apr 19, 2022

mattlord Apr 19, 2022

mattlord Apr 19, 2022 •

edited

Loading

deepthi Apr 19, 2022

mattlord Apr 20, 2022

mattlord Apr 19, 2022

deepthi left a comment

Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058

Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058

Conversation

mattlord commented Apr 7, 2022 • edited Loading

Description

Related Issue(s)

Checklist

mattlord commented Apr 19, 2022

rohit-nayak-ps left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattlord Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

mattlord commented Apr 7, 2022 •

edited

Loading

mattlord Apr 19, 2022 •

edited

Loading